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Abstract. Define Minimum Soapy Union (MinSU) as the following 
5_j ' optimization problem: given a fc-tuple (Xi,X 2 , ■ ■ ■ ,Xk) of finite integer 

f sets, find a fc-tuple (ti,t2, . . . ,tk) of integers that minimizes the cardi- 

nality of (Xi + h) U (X 2 + t 2 ) U ■ ■ ■ U (X n + t k ). We show that MinSU 
is NP-complete, APX-hard, and polynomial for fixed k. 
MinSU appears naturally in the context of protein shotgun sequencing: 
Here, the protein is cleaved into short and overlapping peptides, which 
are then analyzed by tandem mass spectrometry. To improve the qual- 
ity of such spectra, one then asks for the mass of the unknown prefix 
(the shift) of the spectrum, such that the resulting shifted spectra show 
a maximum agreement. For real- world data the problem is even more 
complicated than our definition of MinSU; but our intractability results 
clearly indicate that it is unlikely to find a polynomial time algorithm 
for shotgun protein sequencing. 



1 Introduction 

The aim of this paper is to study the computational complexity of the following 
optimization problem: 

Name: Minimum Soapy Union (MinSU) 

Input: a finite set A and an indexed family (X a ) aeA of non-empty finite sets of 
rational integers. 

Solution: an indexed family (t a ) al ~ A of rational integers. 
Measure: the cardinality of [J aeA {X a + t a ). 

Let us name Soapy Union (SU) the decision problem associated with 
MinSU. The names have been chosen by analogy with the Soapy Set Cover 
problem [2]. Clearly, SU is a number problem [TT]. MinSU can be seen as a 
generalization of the Subset Matching problem [§]: optimally solving Subset 
Matching is equivalent to optimally solving the restriction of MinSU to those 
instances {X a ) aeA such that the cardinality of A equals 2. 

MinSU naturally appears in the context of protein shotgun sequenc- 
ing [6 5 4. (This problem must not be confused with the more widely known 
peptide shotgun sequencing.) Sequencing the protein means that we want to 
determine its amino acid sequence. We assume that no genomic information is 
available for the protein, so that its sequence cannot be derived from the genomic 
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information. This is the case for many proteins even in humans, monoclonal an- 
tibodies being an important example [5]. Experimentally, the protein is cleaved 
into short and overlapping peptides, which are then analyzed by tandem mass 
spectrometry. To improve the quality of such spectra, one then asks for the mass 
of the unknown prefix (the shift) of the spectrum, such that the resulting shifted 
spectra show a maximum agreement. For real-world data the problem is even 
more complicated than our definition of MinSU; but our intractability results 
clearly indicate that it is unlikely to find a polynomial time algorithm for shotgun 
protein sequencing. 

Contribution. In Section [2j we prove that SU belongs to NP and that MinSU 
can be solved in polynomial time for fixed A. In Section [3j we show that SU is 
strongly NP-hard; furthermore, we prove that there exists a real number p > 1 
such that if MinSU is p-approximable in pseudo-polynomial time then P = NP. 

Notation and definitions. For every finite set S, \S\ denotes the cardinality of S. 
For all sets A and S, S A denotes the set of all families of elements of S indexed 
by A. 

The ring of rational integers is denoted Z. For every integer n > 0, [l,n] 
denotes the set of all k S Z such that 1 < k < n. 

A(n undirected) graph is a pair G — (V,E), where V is a finite set and E 
is a set of 2-element subsets of V: the elements of V are the vertices of G, the 
elements of E are the edges of G, and for each edge e € E, the elements of e are 
the extremities of e. 

Let MiN be a minimization problem. The decision problem associated with 
Min is: given an instance / of Min and an integer k > 0, decide whether there 
exists a solution of Min on / with measure at most k. 



2 Membership 

For each instance (X a ) aeA of MinSU, the set of all feasible solutions of MinSU 
on {X a ) a( - A equals Z" 4 , which is infinite. Therefore, MinSU is not an NP- 
optimization problem [2J, and thus the membership of SU in NP is not completely 
trivial. 

Let G — (V, E) be a graph. A disconnection of G is a pair (B, C) such that 
B + 0, C ± 0, B n C = 0, V = B U C, and for every (6, c) 6 B x C, {6, c} £ E. 
A graph is called disconnected if it admits a disconnection. A graph that is not 
disconnected is called connected. 

Let (Y a ) a&A be an indexed family of sets. The intersection graph of (Y a ) a( - A 
is defined as follows: its vertex set equals A and for all b, c G A with b ^ c, {b, c} 
is one of its edges if, and only if, Yb<lY c ^0. 

Lemma 1. Let (Y a ) al - A be an instance of MinSU. If the intersection graph of 
(Y a ) aeA is disconnected then there exists (u a ) aeA £ Z A such that 



|J (Y a + Ua) 



< 



IK 



(1) 



a£A 



a£A 
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Proof. For each subset B C A, put Yg = U &eB Y},. Let (£?, C) be a disconnection 
of the intersection graph of (Y a ) a£A . Let r G Yb and s G Yc be fixed. Set 
lib = — r for every b G £? and u c = — s for every c G C. On the one hand, we 
have Y B nY c = and = Yg U Y c , so 

|Y A | = \Y B \ + \Y C \ ■ (2) 

On the other hand, we have 

IJ (Y a + u a ) = (Y B - r) U (Y c - s) 

aeA 

and 

(Y B - r) n (Y c - s) ^ 
because G (Y# — r) n (Yc — s); it follows 



i < |Ys| + |Yc-| . (3) 

It now suffices to combine Equations ([2]) and ([3]) to obtain Equation (fTJ). □ 

Lemma [T] can be restated as follows: 

Lemma 2. Lei (X a ) aeA be an instance of MinSU. For any optimum solution 
(ta) a( zA °f MinSU on (X a ) aeA , the intersection graph of (X a + t a ) aeA is con- 
nected. 

Proof. Let (t a ) ae A e ^ A be such that the intersection graph of (X a +t a ) aeA 
is disconnected. Set Y a = X a + t a for each a £ A. By Lemma Q] there exists 
(ua) a£ A e ^ such that Equation ((T|) holds. It follows that (t a + u a ) a£A is a 
better solution of MinSU on (X a ) aeA than (t a ) a£ A- ^ 

Definition 1. Let G = (V,E) be a graph. Put 

E = {(a, b) G V x V : {a,b} G £} . 

yln antisymmetric edge-weight function on G is a function vj from E to Z 
suc/i £/iai w(b,c) = — w(c, b) for every (b,c) G -E. -For eac/i antisymmetric edge- 
weight function m on G, define S(G,uj) as the set of all (t a ) ae y G suc/i i/iai 
tb —t c = tu(b, c) for all (&, c) <E E. 

Let us comment Definition Q] The function assigns both a magnitude and 
an orientation to each edge of G: for all a, b G V such that {a, b) G L7, the 
magnitude of {a, 6} is the absolute value of w(a, b) and the orientation of {a, 6} 
is determined by the sign of w(a, b). It is clear that for every (t a ) aeV G S(G, w) 
and every ueZ, (t a + it) aey G S(G, w). If G is connected then either S(G, w) is 
empty or there exists (t a ) ae y G Z v such that S(G, vj) = { (t a + u) aev : u G Z}. 
If G is connected and S^G, to) 7^ then for any (6, it) G V x Z, the unique element 
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{t a ) ae y G S(G, w) that satisfies tj, = u is computable from G, tn, 6, and u in poly- 
nomial time. A closed walk in G is a finite sequence (ao, ai, 02, • ■ • , «fc) such that 
ao = cifc and {a,_i, ai} € E for every i G [1, k\; the weight of (ao, 01, «2, • • • , a>k) 
under vd is defined as w(ao, di) + vj{a\, 02) + • • • + w{ak-x, dk)- A (simple) cycle 
in G is a closed walk (ao, a\, 02, . . . , a^) in G such that for all i, j G [1, fc], dj = a 3 
implies i = j. The following three conditions are equivalent: 

1. The set S(G,zu) is non-empty. 

2. The weight under w of every closed walk in G equals 0. 

3. The weight under w of every cycle in G equals 0. 

The second and third conditions can be thought as abstract forms of Kirchhoff's 
voltage law. 

A tree is a connected graph with one fewer edges than vertices, or equiva- 
lcntly, an acyclic connected graph. An arbitrary graph G = (V, E) is connected 
if, and only if, there exists a subset E' C E such that (V, E') is a tree ((V, E') is 
then called a spanning tree of G) . 

Lemma 3. Let (X a ) aeA be an instance of MinSU. There exist a tree H with 
vertex set A and an antisymmetric edge-weight function w on H that satisfy the 
following two conditions: 

1. Every integer in the range of w can be written as the difference of two ele- 
ments of\J aeA X a . 

2. Every element of S{H,vd) is an optimum solution of MinSU on (X a ) aeA . 

Proof. Let (t a ) a£A be an optimum solution of MinSU on {X a ) a( - A . Let H be 
a spanning tree of the intersection graph of (X a + t a ) a£A : such a tree exists by 
Lemma [2] Let w be the antisymmetric edge- weight function on H defined by: 
for all b, c S A such that {0, c} is an edge of H, za(b, c) — t c . 

For all b, c e A, such that {&, c} is an edge of the intersection graph of 
(X a + t a ) aeA , (Xi, + tb) n (X c + t c ) is non-empty, and thus t& — t c belongs to 
X c — Xi,. Therefore, the first condition holds. Now, remark that S(H,vu) = 
Ut a + u) a£A : u G Z}, so the second condition holds. □ 

Theorem 1. SU belongs to NP. 

Proof. Let ((X a ) a&A , k) be an arbitrary instance of SU. We propose the fol- 
lowing (non-deterministic) algorithm to decide whether ((X a ) aeA , k) is a yes- 
instance of SU: 

— Guess a tree H with vertex set A and an antisymmetric edge- weight function 
w on H such that the first condition of Lemma [3] holds. 

— Compute an element {t a ) ae A 6 S(H,zu). 

— Check whether the cardinality of UaeA^a + O ^ s a ^ mos t k. 

By Lemma[3J the algorithm is correct. Moreover, the bit-length of the guess (z.e, 
the ordered pair [H, m)) is polynomial in the bit-length of the input (i.e, the 
instance (X a ) aGA ), so the algorithm can be implemented in non-deterministic 
polynomial time. □ 
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Let m be a positive integer and let X be a subset of Z such that X — ~X. 
On each given m-edge graph, there are exactly \X\ m distinct antisymmetric 
edge- weight functions whose ranges are subsets of X. 

Let n be a positive integer and let 7~ n denote the set of all trees with vertex 
set [1, n}. Cayley's formula ensures \T n \ — n n ~ 2 |12j . Moreover, every tree can be 
reconstructed in polynomial time from its Priifer code [12) . so T n is enumerable 
in O (n°W) time. 

Theorem 2. There exists an algorithm that, for each instance (X a ) aeA of 
MinSU given as input, returns an optimum solution of MinSU on (X a ) a&A 
in O (N°(\ A ^} time, where N denotes the bit-length of (X a ) a€A . 

Proof. Put U = UaeA Let H denote the set of all ordered pairs of the form 
(H,w), where H is a tree with vertex set A and w is an antisymmetric edge- 
weight function on H whose range is a subset of U — U . We propose the following 
algorithm to solve MinSU on (X a ) aeA : 

— For each {H,m) £ %, compute an element of S(H,m). 

— Return a best solution of MinSU on (X a ) aGA among those computed at the 
previous step. 

By Lemma|3l the algorithm returns an optimum solution of MinSU on (X a ) aeA . 

Moreover, remark that \H\ = \A\^ A ^ 2 \U — L/|'' 4 ' 1 and that T-L is enumerable 
in O {N°^ A \^ time. Therefore, the algorithm can be implemented to run in 



3 Hardness 

The aim of this section is prove the hardness results for MinSU. 

Let G = (V, E) be a graph. A vertex cover of G is a subset C C V such that 
C (~1 e 7^ for every e £ E: a. vertex cover is a subset of vertices that contains at 
least one extremity of each edge. 

Name: Minimum Vertex Cover (MinVC) 

Input: a graph G. 

Solution: a vertex cover C of G. 

Measure: the cardinality of C . 

The decision problem associated with MinVC is named Vertex Cover (VC). 
It is well-known that VC is NP-complete [11] , 

To prove that SU is (strongly) NP-complete, we show that VC Karp-reduces 
to (a suitable restriction of) SU. The following gadget plays a crucial role in 
our reduction as well as in other reductions that can be found in the literature 



Definition 2. For each integer n > 1, define R n — {(i — l)n 2 + i 2 : i E [l,n]}. 




□ 
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A Golomb ruler |10|15|3] is a finite subset R C Z that satisfies the following 
three equivalent conditions: 

- For every t G Z, t ^ implies |i? n (i? + t)| < 1. 

- For every integer d > 0, there exists at most one (r, s) G R x i? such that 
7- — s = <i. 

- For all r\, r 2 , si, s 2 G R, r x + r 2 = s\ + s 2 implies {ri,r 2 } = {si,s 2 }. 

Actually, only the first condition is referred to in what follows. Among other 
convenient properties our gadget sets are Golomb rulers: 

Lemma 4. Let n be a positive integer. The following four properties hold. 

1. The least element of R n is 1 and the greatest element of R n is n 3 . 

2. The cardinality of R n equals n. 

3. The distance between any two elements of R n is at least n 2 + 3. 
4- R n is a Golomb ruler. 

Proof. Properties [T]and [2] are clear. Proofs of Property |4] can be found in |14|13j . 
Finally, remark that for every i > 1, we have 

[in 2 + f ) 2 ) - ((i - l)n 2 + i 2 ) = n 2 + 2i + I > n 2 + 3 . 

Hence, the distance the distance between any two consecutive elements of R n is 
at least n 2 + 3, and thus Property [3] holds. □ 

Theorem 3. SU is strongly NP-hard. 

Proof. Put f{x) = (±x + 2) + \x - 4. Let Aux denote the restriction of SU 
to those instances {{X a ) a ^ A , fc) such that the absolute value of every integer 
in {k} U \J aeA X a is at most / (max ae ^ \X a \). We prove that Aux is NP-hard 
which implies the theorem. More precisely, we show that VC Karp-reduces to 
Aux. 

Presentation of the reduction. Let I be an arbitrary instance of VC. The reduc- 
tion maps / to an instance J of SU that is defined as follows. Let G, V, E, and 
k be such that I = (G,k) and G = (V,E). Let n denote the cardinality of V. 
Without loss of generality, we may assume V = [l,n] and k < n because I is 
a yes-instance of VC whenever k > n. Let (y e ) eeE , i z e) eeE G V E be such that 
e = {y e , z e } for every e G E. Set 

A = {®}UE, 
s = (n + 4) 3 , 

R = Rn+i > 

X = (V - s - n) U (R - s) U (R + n) U (V + s + n) , 
X e = {z e - n} U R U {y e + s} 

for each e G E, and 

J={{X a ) aeA ,\X $ \+k) . 
Clearly, J is computable from / in polynomial time. 
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An instance of Aux. Let us prove that J is in fact an instance of Aux. With 
the help of Lemmas I 111 I and 14121 it is easy to see that 1 — s — n is the least 
element of UaeA X a , that s + 2n is the greatest element of UaeA -^a, an< ^ that 
the cardinality of Xq equals 4n + 8. The latter property implies | J^© | + fc < 5n+8. 
Hence, the absolute value of every integer in {\X$\ + k} U \J aeA X a is at most 
s + In. Now, remark that s + 2n = f(\Xq)\) < / (max a6j 4 |X Q |). 

Correctness of the reduction. It remains to prove that / is a yes-instance of VC 
if, and only if, J is a yes-instance of SU. 

Lemma 5. For every e € E, it holds true that 

1. (X e — s) \ X® = {y e } and that 

2. (X e +n)\Xu = {z e }. 

Proof. We only prove Property Q] because Property [5] can be proven in the same 
way. Put Y = {z e — s — n} U (R — s). It is clear that I e -s=yu {y e } and 
Y C Xfi. Therefore, we have 

X e -s = {y e }\X $ . (4) 
Moreover, it follows from Lemma [4111 that 

— the greatest element of (V — s — n) U (R — s) equals and that 

— the least element of (R + n) U (V + s + n) equals n + 1. 

Therefore, X$ does not contain any element of [l,n]. In particular, y e does not 
belong to Xq. Combining the latter fact with Equation (|4}, we obtain Property [TJ 

□ 

Lemma 6. For every t £ 1, \(R + t) \ Xq\ < n implies t G {— s, +n}. 

Proof. Let us first bound from above the cardinality of (R + t) D X% . For each 
t e Z, put P T = (R + t) n (V + t) and Q T = (R + t) D(R + t). First, it follows 
from Lemma T4I3I that \P T \ < 1. Second, r ^ t implies \Q T \ < 1 by Lemma [4141 
And third, it holds that 

{R + t)nx 9 = P_ s _„ U Qs U Q n U P s+n . 

Now, assume t ^ {— s, +n}. From the preceding three facts, we deduce that 

\(R + t)DX $ \ < |P_ s _ n | + |Q_ s | + |Q n | + |P s+n | <4. 

(In fact, it is not hard to see that \(R + t)nXqi\ < 2 holds: t > +n implies 
P-s-n = Q-s = 0, — s < t < +n implies P- s - n = Ps+n = 0, and t < — s implies 
Q n = P s +n = 0-) Since \R + t\ = n + 4 by Lemma l4l2l we finally get 

\(R + t)\X$\ =n + A-\{R + t)C\X $ \>n. 



□ 
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(If). Assume that I is a yes- instance of VC. Then, there exists a vertex cover 
C of G with |C| < k. Put F = {e G E : y e G C}. Set t = 0, t e = -s for each 
e£f, and i e = +n for each e G E\F. On the one hand, it holds that 



U + *«> 



(J (X e + t e ) \ A 



because t = 0. On the other hand, it follows from Lemma [5] that 
(J (X e +t e )\X III = {y e : e E F}U{z e : e e E\F} . 



e£E 



Since the right-hand side of Equation ^ is a subset of C, we have 



U (X e + i e ) \ X q 



eeE 



< k . 



We then get 



\J(Xa+t a ] 



aeA 



< \X a \+k 



(5) 



(6) 



(7) 



(8) 



by combining Equations (J5)) and (JT)). Hence, J is a yes-instance of SU. 



(Only if). Assume that J is a yes-instance of SU. Then, there exists (t a ) ae A e 
Z A such that Equation ((HJ holds. Replacing (t a ) al£A with {t a — ty,) a<£A leaves the 
cardinality of [J a&A (X a + t a ) unchanged; therefore, we may assume that t = 0; 
in particular, Equation ([5]) holds. 
Put 

C= \J(X e + t e )\X . 

Combining Equations (O and ((SJ), we obtain Equation ([7]), or equivalently, |C| < 
k. Now, let us prove that C is a vertex cover of G. Consider an arbitrary edge e € 
-E. Since we have 

(i? + ^)\A c (i e +g\i,cc, 

it follows from Lemma [6] that £ e G {— s, +n}. Consequently, Lemma [5] ensures 
that some extremity of e belongs to (X e +t e )\X$, and this extremity is a fortiori 
in C. Hence, / is a yes-instance of VC. □ 

A graph G = (V, E) is called cubic if for every vertex v G V, the degree 
of v in G (i.e., the cardinality of {w G V : {u, w} G £7}) equals 3. Let MinVC3 
denote the restriction of MinVC to cubic graphs. MinVC3 is APX-complete 
under L- reduction pQ; moreover, if MinVC3 is ^-approximable in polynomial 
time then P =NP 0. 

To prove that MinSU is "strongly" APX-hard, which is a better result than 
Theorem[3j we show that MinVC3 L-reduces to a suitable restriction of MinSU. 
In fact, we simply adapt the proof of Theorem |3l 
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Theorem 4. There exists a real constant p > 
approximable in pseudo-polynomial time then P = 



1 such that if MinSU is 
NP. 



Proof. Let / be as in the proof of Theorem [3] and let MinAux denote the re- 
striction of MinSU to those instances {X a ) a£A such that the absolute value of 
every integer in \J aeA X a is at most / (max a(Ej 4 \X a \). We prove that MinAux 
is APX-hard, which implies the theorem because every pseudo-polynomial-time 
approximation algorithm for MinSU is a polynomial-time approximation al- 
gorithm for MinAux. More precisely, we show that MinVC3 L-reduces [2] to 
MinAux. We use the notation of the proof of Theorem |3l 

From a graph to an instance of Aux. Let r denote the minimum cardinality of 
a vertex cover of G. Let v denote the minimum cardinality of UaeA^a + 
over all {t a ) ae A e ^ • Clearly, (X a ) aeA is computable from G in polynomial 
time ((X a ) aeA is independent of k), (X a ) a£A is an instance of MinAux, and 
v = \X^\ +t = 4n + 8 + r. 

Now, assume that G is cubic and n > 24. The first assumption implies 
3r > \E\ > n. It follows 



An 



= [ I + - J n< I 12+ — ] t VAr. 



and thus v < 14r. 



From a solution of Aux to a vertex cover. Let (t a ) ae A ^ ^ >u ^ 



aeA 



\X a 



There exists a vertex cover C of G that satisfies \C\ < k, or equivalently, 



\C\-r< 



U ( X - + *«/ 



aeA 



Moreover, such a vertex cover is computable from G and {t a ) a£ A m polynomial 
time: 

— if k > n then set C — V and 

- if k < n then set C = {J eeE (X e + t e — t$) \ Xq. 

Conclusion. Let £ be a positive real number. If MinSU is (1 + e)-approximablc 
in pseudo-polynomial time then MinVC3 is (1 + 14e)-approximable in polyno- 
mial time. Therefore, if MinSU is j||g-approximable in pseudo-polynomial then 
P = NP. " □ 



An immediate corollary of Theorem [4] is that MinSU does not admit any 
(pseudo-)polynomial time approximation scheme. 
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4 Open questions 

The following three questions remain open: Does there exist a constant p > 1 
such that MinSU is p-approximable in (pseudo-)polynomial time? Is SU fixed- 
parameter tractable [9] with respect to parameter \A\7 Is SU solvable in polyno- 
mial time for bounded max n£/ 4 \X a \? 
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